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Abstract 

A compact scheme is a discretization scheme that is advantageous in obtaining highly 
accurate solutions. However, the resulting systems from compact schemes are tridiago- 
nal systems that are difficult to solve efficiently on parallel computers. Considering the 
almost symmetric Toeplitz structure, a parallel algorithm, simple parallel prefix (SPP), 
is proposed. The SPP algorithm requires less memory than the conventional LU de- 
composition and is efficient on parallel machines. It consists of a prefix communication 
pattern and AXPY operations. Both the computation and the communication can be 
truncated without degrading the accuracy when the system is diagonally dominant. A 
formal accuracy study has been conducted to provide a simple truncation formula. Ex- 
perimental results have been measured on a MasPar MP-1 SIMD machine and on a Cray 
2 vector machine. Experimental results show that the simple parallel prefix algorithm 
is a good algorithm for symmetric, almost symmetric Toeplitz tridiagonal systems and 
for the compact scheme on high-performance computers. 
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1 Introduction 


Recent technological advances have made it possible to build computers that contain thousands of 
processors and can obtain gigaflops (10 9 floating-point operations per second) on real applications. 
Emerging parallel computers are designed to solve large problems and achieve better accuracy 
than could previously be obtained [22]. Parallel computers demand new models, new discretization 
methods, and new algorithms to explore the potential of high-performance computing. 

Conventionally, partial differential equations (PDEs) are discretized by finite-difference or finite- 
element methods, and are solved by Gauss-Seidel, conjugate- gradient, or successive over relaxation 
(SOR) methods. A new discretization method, the compact finite- difference scheme (or compact- 
difference scheme , compact scheme) was proposed by Kreiss and Oliger [12] and was later im- 
proved upon by Lele [15]. Compared with the traditional finite-difference scheme, the compact 
finite-difference scheme achieves higher accuracy with smaller difference stencils and leads to more 
accurate approximations because of the smaller coefficients on the truncation error. With these ad- 
vantages, the compact scheme has quickly gained popularity. In practice, the resulting discretized 
system of the compact schemes are tridiagonal systems that can be solved efficiently on sequential 
machines. However, tridiagonal systems are difficult to solve efficiently on parallel computers. For 
example, to study the physics of compressible homogeneous turbulence, the CDNS (compressible 
direct simulation of Navier- Stokes) code, based on a sixth-order compact scheme and a third-order 
time discretization, was implemented on the Intel Delta [7]. After carefully choosing an existing 
tridiagonal solver, mapping the algorithm to the architecture, and overlapping communication with 
computation, the communication overhead consumed about 30 percent of the total execution time. 
Clearly, more efficient algorithms are needed to explore the potential of compact schemes on parallel 
computers. 

In recent years, intensive research has been done to develop efficient tridiagonal solvers on 
parallel computers. A good survey can be found in references [17], [9], and [13]. Of the known 
tridiagonal solvers, both the recursive-doubling reduction method (RCD) developed by Stone [19], 
and the odd-even, or cyclic, reduction method (OER) developed by Hockney [10] are able to solve an 
n-dimensional tridiagonal system in 0(log(n)) time using n processors. These methods are designed 
for fine-grain computing. Substructured methods developed by Lawrie and Sameh [14], Wang [27], 
and Sun, Zhang, and Ni [26] were designed for median-grain and coarse-grain computing (i.e., the 
case of p < n or p « n, where p is the number of processors available). Lawrie and Sameh’s 
algorithm is designed for shared-memory machines; Wang’s algorithm is designed for distributed- 
memory machines; and Sun et al. proposed three different algorithms, each of which may be a better 
choice depending on the problem and the machine. For compact schemes, the tridiagonal systems 
have a special structure that consists of diagonal dominance and are almost symmetric Toeplitz. 
For this special structure, a parallel tridiagonal solver for fine-grain computing, the simple parallel 
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prefix (SPP) algorithm, is proposed in this paper. It shows that compact schemes can be solved 
efficiently on parallel computers. Since the same special structure appears in many other scientific 
applications, such as alternating direction implicit method [21], wavelet collocation method [2], 
spline curve fitting [4], etc., the importance of the SPP algorithm is beyond compact schemes. 

The SPP method is simple to implement. It requires only 2 • log(n) AXPY (vector plus scalar 
times vector) operations and prefix communication patterns. If the tridiagonal system is diagonally 
dominant, then the AXPY operations can be truncated after a certain number of steps without 
degrading the accuracy. A formal accuracy analysis is conducted and simple formulas are provided 
to compute the number of AXPY operations necessary. 

This paper is organized as follows. Section 2 will present the compact scheme and discretized 
tridiagonal systems. Section 3 will introduce three versions of the simple parallel prefix algorithm: 
the SPP for tridiagonal systems with the given special structure, the SPP for solving symmetric 
Toeplitz tridiagonal systems, and the SPP for solving almost symmetric Toeplitz systems. Accuracy 
analysis will be conducted in Section 4. Section 5 will give experimental results on a 16K processing 
elements (PEs) MasPar SIMD computer and on a Cray 2 supercomputer. Finally, Section 6 will 
give the conclusions. 


2 Compact Finite-Difference Schemes 

With either conventional finite-difference or finite-element discretization methods, as the order of 
the approximation increases, the required number of boundary and near-boundary relations and 
the required number of mesh points per derivative stencil increases accordingly. To achieve higher 
accuracy with less additional mesh points, the compact scheme [12] was introduced. As originally 
suggested by Kreiss and Oliger [12], and later discussed for fluid dynamics problems by Hirsh [8], 
the first and second derivatives for compact differences may be approximated by 
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and h x is the mesh spacing, which is constant for simplicity. By multiplying (1) by the respective 
denominators, relations for the derivatives may be found, which yield 
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and 


(3) 


22 /n- 1 + g/n + J^/n+1 “ J^ 2 ^ n + l - n + fn-l)- 

These equations yield tridiagonal systems when the appropriate boundary conditions are applied. 

To make an accurate comparison between the compact- difference (equations (2) and (3)) and the 
standard central-difference schemes, Taylor-series expansions are employed. As Hirsh has shown, 
the truncation error for the compact differences are 

E (fn) = ~^htf^ and E (0 = -±hifW. 

Similar error analyses for the central differences yield 

E(fn) = -^ h tf iv} and E iO = 

Although both schemes are fourth-order accurate, the compact- difference scheme should lead to 
more accurate approximations as a result of the smaller coefficients on the truncation error. Similar 
results hold for other higher order approximations. 

As yet, no mention has been made about the boundary treatment for the compact scheme. 
At the boundaries, Hirsh [8] experimented with a variety of boundary conditions, and Adams [1] 
suggested a boundary relation that includes near-boundary derivatives in the formulation. The 
boundary conditions used by both Hirsh and Adams retained the tridiagonal nature of the system. 
Demonstrated some 30 years ago, the boundary-condition stencil can take the form 


c o/o + fi — afo + (3fi + 7/2 + 


(4) 


where co = l,<a=^,/3 = —a, and 7 = e = 0 for a second-order boundary condition; Co = y, 
a = /? = 7 = and £ = 0 for a third-order boundary condition; and Co = a = 

(3 = 7 = 2 ^-, and e = for a fourth-order boundary condition. Additional high order stencils 
have been described by Carpenter, Gottlieb, and Abarbanel [3]. To demonstrate the SPP algorithm, 
for simplicity, explicit fourth-order one-sided finite differences will be used for boundary conditions. 

Many relevant fluid dynamics applications can make use of high-order compact- difference oper- 
ators to numerically solve the governing systems of equations. For example, Burger’s equation, the 
boundary- layer equations, and the driven cavity problem were solved by Hirsh [8] with compact- 
difference operators. Further, Joslin et al. [11] used the compact- difference equations (2) and (3) 
to numerically solve the fully nonlinear Navier-Stokes equations of fluid dynamics. 
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where u\ — (rq, ^ 2 , ^ 3 ) are the velocity components that correspond to the steamwise, normal, and 
spanwise directions, X{ = (oq, £ 2 ? # 3 ); v is the kinematic viscosity, p is the density, and repeated 
indices infer a summation over the index. To demonstrate how compact- difference operators could 
be employed to solve the nonlinear PDE’s (5) and (6), consider the model problem of the one- 
dimensional heat-conduction equation: 

ffo 1 1 1 

(7) 


du d 2 u 

dt dx 2 


To solve this equation computationally, discretizations in time and space must be chosen. From 
the Taylor-series expansions in time, one derives the discrete equation 


— o, n 


u n + nAt 


d 2 u n 
dx 2 ’ 


( 8 ) 


where n corresponds to a time level. If the result at level u n is known, then the solution u n+1 can be 
obtained if d 2 u n jdx 2 can be determined. The spatial derivative can be computed with the second- 
derivative operator (3). Each time-step advancement requires a single compact- difference solve. In 
the original nonlinear PDE systems (5) and (6), an explicit or semi-implicit time discretization will 
lead to both first and second order derivative operators evaluated on previous time-step information. 
On standard cartesian grids, the resulting discrete system requires differential operator solves which 
are scalar matrices and are similar to the model problem (8). Because time- marching is necessary, 
compact-difference solves are required at each time level. This necessitates a fast compact- difference 
solver. By observation, equations (2) and (3) take the matrix form 


Co 1 
1 c 1 


x = d 


1 


c 1 


1 


(9) 


where x = {/', /"} and c = {4, 10} correspond to the compact- difference parameters. The first and 
last rows of equation (9) arise from boundary conditions. 

With higher orders of approximation, the resulting matrix will differ only in the boundary 
conditions, however, the resulting tridiagonal systems can be written in the almost symmetric 
Toeplitz form described in the next section, Eqn. (10), where A is given by (12). 
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3 The Simple Parallel Prefix Algorithm 

We are interested in solving a tridiagonal linear system of equations 


Ax = d. (10) 

In this system of equations, A is either a symmetric Toeplitz tridiagonal system of order n 

c 1 
1 c 1 

A= • • • =[l,c,l], (11) 

• • 1 

1 c 

or an almost symmetric Toeplitz tridiagonal system of order n 

c 0 1 
1 c 1 

• • 1 

1 c' c 

where x = (aq, • • • , x n ) T and d = (d\ • • • , d n ) T are n-dimensional vectors. We assume that matrix 
A is diagonally dominant (i.e., \c\ > 2). Although we assume that A, x, and d have real coefficients, 
the extension to the complex case is straightforward. 

3.1 The Simple Prefix Method 

Before solving symmetric systems and almost symmetric Toeplitz systems with two boundary condi- 
tions, the simple prefix method is first introduced to solve a special kind of one-boundary-condition 
systems typical of hyperbolic systems. The simple prefix method, then, can be modified for more 
general situations. In this section, we study how to efficiently solve the system 

Ax = d (13) 
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on parallel computers, where 


A = 


fa 1 

1 c 
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= a- [6,1,0] -[0,1,6], 


and a and b are the real solutions of: 


a b — c 
a • b = 1 


(14) 


For a given value of c, the relations in (14) define the values of a and b. In general, the value 
obtained for a would lead to an inconsistency with the coefficients of the boundary stencils given 
in equation (4). However, for the fourth-order compact- difference scheme described here, which 
will require at least third-order accurate boundary conditions, the third and fourth order boundary 
conditions shown in equation (4) can be combined to yield a one-parameter family of boundary 
conditions, 


a /o + fi ~ ( — (15 + 2a) /o + (12 — 3cr)/i + (3 + 6a) f 2 — afs ) , 

where a = 2{i-cr) • va l ue °f a is used to determine a. Note, a = 0 and a = 1 correspond to the 
original third and fourth order conditions of equation (4). Because a • b = 1 and \c\ > 2, we may 
further assume that \a\ > 1 and |6| < 1. By equation (13), 

x = a~ l • [0, 1, 1, 0 ]~ l d = b • [0, 1, 1, 0] -1 d. 


Let L = [—6, 0, 0]. Then 


[fr, 1, 0] = [0, 1, 0] — [—6, 0, 0] = I — L 


and 


[b, 1, 0] -1 = (I + L + L 2 + --- +L n ~ l ) (15) 

= (/ + L 2nsnl_1 )(/ + L 2risnl “ 2 ) (I + L 4 )(I + L 2 )(I + L). (16) 

Note that n is the dimension of matrix A and that equations (15) and (16) can be verified directly. 
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The superscript of matrix L represents matrix multiplication 


0 0 
(-by o 


\ 0 0 (-6)* 0 0/ 

where the first nonzero element (—6)* is at position (i + 1 , 1). Similarly, let U = [0,0, — b ]. Then, 


[0,1, b\ = [0,1, 0] - [0, 0,-6] =I-U, 


where 


[0,1, ft]" 1 = (I + U + U 2 + ••• + U n ~ l ) 

= (i + u 2 ^- 1 ) (. I + U 2 )(I + U ), 


o o (-by o o 


0 0 (- 6 )* 
0 0 


The first nonzero element of U l is at position (1 , i + 1). Thus, the solution of equation (13) is 

X = b-(I + U 2f ' en] ~ 1 )---(I + U) ■ (/ + L 2nsnl_1 ) (I + L)d. (17) 

Let v = (iq, 1 * 2 , • • • , v n ) T be an n-dimensional vector. Given the special structure of i?, we find 

(I + L l )v = v + ( -bfvti ) , 


where 


V(i) = (0, •••0,U1 


and v\ is the i + 1 element of v^y Similarly, given the special structure of ? 7 Z , we find 


(I + U l )v = « + (-6)V < ), 
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where 


V (t) = ,v n ,0---0) T . 


Equation (17) shows that equation (13) can be solved with 2 • fig n] AXPY operations. Because 
|b| < 1, ||L*|| 0 and ||?7 Z || — >► 0 when n oo, the AXPY operation may be truncated without 

influencing the accuracy. Formulas will be given in Section 4 to compute the smallest truncation 
integer k. A sequential code for solving equation (17) within truncation error is given in figure 1. 


for % <— 0 to k — 1 do 

for j <— 2 l to n do 

dj = dj + {—b) 2% dj_2i 

j <-j + 1 

i <— i + 1 

for % 0 to k + 1 do 

for j 1 to n — 2 l do 

dj = dj + (— b) 21 dj+ 2 i 

3 <- 3 + 1 

i i + 1 

for j 1 to n do 

Xj = b • dj 

3 3 + 1 

Figure 1. The simple prefix method. 


If n processing elements are available and dj is stored in processor j, then the two for loops of 
i in figure 1 will lead to prefix computations. Figure 2 shows the prefix computation pattern that 
correspond to the second for loop of % when n is equal to 8. The first for loop of % in figure 1 leads 
to a similar prefix computation pattern, except that the communication is from right to left. Prefix 
(or recursive doubling) computation is a widely used computation model in scientific computing. 
Any linear recursive relation can be computed by recursive doubling [23]. A recursive-doubling 
algorithm exists for solving tridiagonal systems and involves matrix- matrix multiplications [6, 26]. 
Compared with the existing recursive-doubling algorithm, our method achieves a smaller operation 
count by adopting the “vertical prefix” computation, in which the matrix- vector multiplication in 
equation (17) is conducted in a parallel prefix fasion. Compared with the cyclic reduction method 
[10], the proposed prefix method has a simpler communication pattern. We call the prefix method 
given in figure 1 the simple prefix method. Figure 3 shows the communication pattern of the widely 
used cyclic reduction method [10]. In a comparison of figures 2 and 3, we can see that the prefix 
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method has a relatively simple communication pattern. Both of the SPP algorithm and the cyclic 
reduction method need 2 x log(n ) communication and computation steps, and both can truncate 
these steps if the system is diagonally dominant. The SPP algorithm has a slightly lower bound 
than that of the existing algorithm on the solution accuracy for the truncated algorithm, in the 
case of diagonally dominant systems. On the other hand, the proposed prefix method also has its 
limitation. It is only feasible for solving almost Toeplitz tridiagonal systems. 



Figure 2. Communication of prefix computation. 
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Figure 3. Communication of cyclic reduction method. 

Fast sine transform (FST), which can be implemented efficiently on certain parallel machines, 
can also be used to solve symmetric Toeplitz tridiagonal systems. The FST method has a similar 
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communication pattern (see Fig. 4) and similar computation count 1 as the SPP algorithm on 
fine-grain parallel computers. However, the communication and computation of the FST method 
cannot be truncated even the linear system is diagonal dominant. 



Figure 4. Communication of Fast sine transform method. 


3.2 Modification of Symmetric Toeplitz System 

Our goal is to find the solution of equation (10). Modification is needed to convert the solution of 
equation (13) to the desired solution. The modification will be different for symmetric Toeplitz sys- 
tems and for almost symmetric Toeplitz systems. For a given symmetric Toeplitz system, equation 
(13) is modified as 

A = A-h AA = A + VE t , 

where 

6 0 \ 

0 0 0 

A A= , (18) 

0 0 0 

V 0 °) 

1 Here we assume that the n unit roots are readily available. Otherwise, the FST method will have a high 
initialization time. 
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and V = (6, 0, • • • , 0) T and E = (1, 0, • • • , 0) T are n-dimensional vectors. By the matrix- modification 
formula [18, 5, 26], equation (10) can be solved by 


x = A~ l d= (A + VE T )~ 1 d 
x = A~ l d- A~ l V{I + E T A- l V)- l E T A~ l d 
= x- A~ l V(I + E t A~ 1 V)- 1 x 1 , 


where x\ is the first element of vector x. If the calculation approach given in [21] is followed, we 
have 


(I^E t A- 1 V)~ 1 = 


1 


and 


Thus 


1 + EL1&* 

A~ l V = (E b 2i , E (~ b )b 2i , E (~b) 2 b 2i , • • • , (—b) n+1 ) T . 


n n — 1 


n — 2 


i = 1 z=l 


i = 1 


1 - 6 2n 1 - . 1 - 0 n ~^ , (1 - b 2 ) 

fu\ iuM ( U\n- 1 V- 1 - u > 


= b 2 


;A~b) 




(19) 


1 _ b2(n+l) ’ V ^ 1 _ b 2(n+l) ’ ’ v 1 _ &2(n+l) ’ ’ v 1 _ &2(n+l) 

\ / 

The final solution is 

x — x — x\z , (20) 


where vector 2 : is the right side of equation (19). Because |6| < 1, 2 ; can be truncated at some 
integer k\ without affecting the accuracy (see section 4). Furthermore, when n is large, b 2 ^ n ~ lJrl \ 
i = 0, 1, • • • , fci, will be less than machine accuracy, and £ reduces to 


2 = ((—b) 2 , (~b) 3 , • • • , (— 6) fcl+2 , 0, • • • , 0) T . 


The program of modification is given in Figure 5, and the algorithm for solving the symmetric 
Toeplitz tridiagonal system is given in Figure 6. 


for i 1 to k\ do 

X — X{— X\Z{ 

% <— % + 1 

Figure 5. Modification for symmetric Toeplitz systems. 
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Step 1: 


Use the simple prefix method to find the solution to equation (13). 

Step 2: 

Use the modification equation (20) to obtain the final solution. 
Figure 6. Algorithm for solving symmetric Toeplitz system. 


3.3 Modification of Almost Symmetric Toeplitz System 

A similar modification can be given to solve almost symmetric Toeplitz systems. For almost sym- 
metric Toeplitz systems, equation (13) is modified such that 

A = A + AA = A + VE t , 


where 


ci 


A A = 


( 21 ) 


ci = cq — a, C 2 = Cq — c, and 


C2/ 


V = 


1 0 ... 0 

0 ... 0 1 


E = 


ci 0 ... 0 

0 ... 0 c 2 


Following the matrix modification formula [26, 21], the solution to equation (10) becomes 


x 


= A~ l d = A~ l d - A~ l V{I + E t A~ 1 V)- 1 E t A~ l d 


= x- A~ l V (I + E T A~ l V)~ l 


ClXi 

C2%n 


where x\ and x n are the first and last element of 5, respectively. With this new modification, 
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( I + E 1 A _1 V) _1 = 


(-1 ) n c 2 b n l + c 2 b 


The final solution is 


X = x- biy 1 - b 2 y 2 , 
where &i, 62 are the solutions of the reduced 2x2 system. 


61 ] =(/ + ^ T i- 1 C)- 1 ( Cl£l 

6 2 / \ C 2 X n 


(22) 


(23) 


Similar to the symmetric case, when n is large, \b\ n may be less than machine accuracy. In 
these cases, 


-1 


(I + E T A^V)- 1 = 1 1 + a ~ b 


1 + c 2 b 


and 


l»\ _ I 


b2 


_C2_ 


l+c 2 b Xn 

Thus, when n is large, the final solution can be computed through a simplified form, 


C\X\ C2X n _2 

x = x y — - -y 

a — b-\- c\ 1 + C 20 


where 


1 -b 

(-1 ) n ~ i b n (- i )™- 2 ^- 1 


(~by~ l ■ ■ • ( -b) n ~ l 
(-1 ) n -ib n -j +1 ••• b 


(24) 


The modification computation for the almost symmetric Toeplitz system is given in figure 7. 
The algorithm for solving the almost symmetric Toeplitz systems is similar to the algorithm for 
solving the symmetric Toeplitz system (see figure 6), except in step 2 the new modification (figure 
7) is used to replace the symmetric Toeplitz system modification. Notice that when Co = a and 
Cq = c, we have c\ = c 2 = 0 and there is no modification necessary. When cq = c and c' 0 = c, we 
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have ci = 6, C 2 = 0, and figure 7 is equivalent to figure 5. 


for i 4— 1 to n do 


X = Xi — 


ClXl ~l _ C 2 X n f.2 
a-b-\-ci^i l+c 2 b^i 


i <— i + 1 


Figure 7. Modification for almost symmetric Toeplitz system. 


4 Accuracy Analysis 

In a previous section, the claim was made that the AXPY operations and modifications can be 
truncated without influencing the accuracy. In this section, we will study the truncation accuracy 
of symmetric Toeplitz systems. For simplicity, we truncate the first for loop and the second for 
loop at the same step k. The norm used in this section is the 1-norm. 

4.1 Accuracy Study of the Simple Prefix Method 

Let x, y be exact solutions as 

y = b- [b, l,0] -1 d, 

= (I + L + L 2 + ■ ■ ■ + L n ~ 1 )b ■ d, 
x = b ■ [0, 1 , b]~ l [b, 1 , 0] -1 ■ d 
= (I + U + U 2 + • + U n ~ 1 )-y. 

Let y* and x f be the solutions given by the first for loop and the second for loop, respectively, 
as 

y* = (/ + -£/ + •••+ L k ~ l ) ■ b - d, 
where k is a power of 2 (k = 2 k ), and 

x' = (/ + [/ + ••• + U k ~ 1 )y*. 

Also, we define as a hypothetical solution by 

x* = (I + U + --- + U k ~ l )y. 
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The difference between x and x* is 


x-x* =(/ + [/ + ••• + U n ~ l )y -(I + U + --- + U k ~ l )y 
= ( U k + U k+1 + • • • + U n ~ l )y 
= (U k + ■■■ + U n ~ 1 ){I -U)x 
= ( U k - U n )x. 

Thus, we find 

< 110*11 • 11/ - U n ~ k II < \b\ k (l + I b\ n ~ k ). (25) 

\\x\\ 

Equation (25) gives the relative error between x and x*. Because the difference between x* and x f 
is 

x* - x f =(/ + [/ + ... + U k ~ l )y -(/ + [/ + ••• + U k ~ l )y * 

= (I + U + --- + U k ~ 1 )(y-y*) 

= (/ + [/ + ••• + U k ~ l )(L k + • • • + L n ~ l ) *b-d 
= (/ + 0 + • • • + U k ~ l ){L k + • • • + L n ~ l ){I — L)(I — U)x 
= (/ + ••• + U - L n )(I - U)x, 

the norm becomes 

■ \ b\ k ■ (1 + | 6|“-*)(1 + | 6 |). ( 26 ) 

If equations (25) and (26) are combined, then the relative error between the exact solution x and 
the truncated solution x f is 



< i&i fc (i + if>r‘xi + ’J'h + 1*!)) 

< 2 • \b\ k • (1 + \b\ n ~ k ). 


4.2 Final Accuracy of Symmetric Toeplitz Systems 

The truncation error of the simple prefix method (Figure 1) will be carried into the modification 
step and will influence the accuracy of the final solution. Let x be the solution of a symmetric 
Toeplitz tridiagonal system (10) and x* be the corresponding solution that has been modified with 
the truncated solution, 


X = X — X\Z, 

* -/ -/ 

X = x — x x z^ 
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where z = b 2 U’ (~ 6 ) i-S”+i) ’ ' ' ' ( _6 ) n 1 i%S+d ) 5 see equation (19). The error gener- 

ated by this truncation is 


x — X*\\ = ||(5 — x) + [x\ — x'i)z\\ 

< ||5 — 5 7 || + ||5i — 5^11 • ||2; 

< ii5-^ii(i + ii^ii). 


The norm of vector 2 : can be computed directly as 


Therefore, 
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l 2 51-1 n-1 

s r^niEw+Eir-) 

z=0 z=0 

^ b 2 (1 + |6| n+1 )(l — \b\ n ) 

~ 1 _ b 2(n+l) • (1 _ | & |) 

< h 2 (1-1 b\ n ) b 2 

- '(1-|6|«+1)(1-|6|) - 1-H 
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a\ — 1 


|a? — rr* 1 1 < ||5 — 5 , ||(1 + — j— ! L-) 

lc| - 1 . 


= \\x — X 


-) 


= \\x — X 


a\ — 1 

n\A ~ H + \b\ 2 
i-H 


)<|| 5 - 5 '||( r -^), 


|5 — 5*|| ^ ||5 — x'\\ . 1 — \b\ + \b \ 2 . 

” ” ” ” v i iTi / 
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.,*11 ’ i-H 

5 — x'\\ ||5|| 1 — |6| + \b\ 2 




i-H 


)• 


(27) 

(28) 

(29) 

(30) 

(31) 


The only unknown in the right side of equation (31) is where 5 is the solution of equation (13) 
and x is the solution of equation (10). For a symmetric Toeplitz system, 


A = A + AA, 
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where A A is given by equation (18). If equations (10) and (13) are combined, we have 


and 


After several calculations, we find 


(A + A A)x = Ax, 

(. I + A~ 1 AA)x = x, 

M<||7 + i-'AA||. 


( 1 -b b 2 • (- b ) n “ 1 \ ( 1 

1 -b ■ (—b) n ~ 2 


A" 1 = b 


1 -b 

1 


-b 1 
b 2 -b 1 


V (-&) 


n — 1 


-b l J 


and 


A^aa = 


E?= 1 6 * 

0 • 

• 0 \ 


0 • 

• 0 

EL“i 2 (-^) 2 ^ 

0 • 

• 0 

v (—b) n+1 

0 • 

• 0/ 


(32) 


Therefore, 


IIa-'aah 


< v” -1 

— Ai = o 




^ b 2 (i+i6i" +i )(i-i6r) 
- 1 ~b 2 ' (1-|6|) 


and 


II I + A lAA \\ < 1 + ^ _ 6 2) 


(1 + |6| n+1 )(l — \b\ n ) 

(1 ~\b\) 


The relative error of the solution to equation (10) is 


x — x*\\ \\x — x'\\ \\x\\ 1 

\\x\\ — \\x\\ \\x\\ 1 — |6| 

^ |6| fe (l + \b\ n ~ k ) ( (1-|6|*)(1 + |6|) 

1 ~\b\ V 1 - \b\ 

|6|*(1 + |6|”~* ) f (1-|6|*)(1 + |6|) 

1 -\b\ V 1-H 


1 + 
1 + 


b 2 ( 1 + |6| n+1 )(l - \b\ n ) 
(1 — b 2 )(l — |6|) 

b 2 \ 
{l-b 2 )(l - \b\)j ' 


(33) 

(34) 

(35) 
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In addition to the truncation error carried from step 1, the truncated modification will also generate 
truncation error. If 

X = x — x\z. 


(Figures 5 and (20)), then 


x — x f = x — x f — (x\ — x[)z + X\ (z — z). 


The 1-norm is given by 


\x — x'\\ < \\x — x'\ \ + \ \x — x'\ \ • \ \z\ \ + \ \x\ \ • \ \z — z\ 

< \\x — x'lKl + ||5||) + ||x|| • | — 5||, 


which leads to 


In equation (36), 


I rp rp ^ II || rp rp ^ \ \ \ \ rp \ 

| | | | | || || | 


\X\\ \\x\ 


{l + \\z\\) + \\z-z\ 


z-z = 


! _(,2,n+l) £ |i>‘(l 

i=ki 

b ‘ 2 nl_1 

1 — ft2(n+l) 


Y \b i+kl (l-b) 2( - nl ~ j) )\, 


j = o 


(36) 


where j = % — Aq, nl = n — k\. Then 


^ \b\ kl+2 (1 + |6| nl+1 )(l - |6| nl ) 

Z ~ Z " - 1 _ &2(n+i) (1 - |6|) 

|6| fel+2 (1 + |6| nl+1 )(l - |6| nl ) 

- 1 _ 62(nl+l) (1 - |b|) 

l 6 lfci + 2 

- l-\b\- 


Note that ||z|| < | |z| |. so that we have the inequality 

x\\ . ... | 6| fel+2 

Jd + IND + U-^ 


\X — X 


\X\ 


< 


\X — X 


< 


\b\ 


1 -\b\ 


1 + 


1 -\b\ 


1 + 


(1 - 6 2 )(1 - | 6 |) 


+ 


\b\ 


k i+2 


(37) 

(38) 


The error introduced by the truncated modification is insignificant if k\ > k — 2. In practice, we 
can choose k\ = k and use the inequality (35) to compute the truncation number k. 
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5 Experimental Results 


In this section, the performance of the SPP algorithm on the MasPar parallel computer and on the 
Cray vector computer will be presented. 

5.1 Parallel Computing 

The MasPar MP-1 is a distributed- memory massively parallel SIMD computer with a high-speed 
two-dimensional toroidal mesh topology. A control unit (ACU) has a direct connection to all 
the processing elements (PEs) and issues instructions at a 12.5 MHz clock rate. Each processing 
element in the array is a Tbit custom load and storage processor with a minimum of 16 kilobytes 
of memory. Communication is relatively cheap on the MasPar. For example, on the MasPar M-l, 
a double-precision multiplication function is ten times more expensive than sending the product to 
an adjacent PE. 

Table 5.1 gives the computation and communication count of the simple parallel prefix (SPP) 
algorithm based on the algorithm (see figure 6) and the communication pattern (see figure 2). Since 
the SPP algorithm can be used for almost symmetric Toeplize systems, the best sequential algo- 
rithm used is the conventional algorithm, Thomas algorithm [20], the LU decomposition method 
for tridiagonal systems. For symmetric Toeplize systems, a fast method proposed by Malcolm and 
Palmer [16] only requires 5n + 2k — 3 arithmetic operations for solving a single system, where k is 
a decay parameter depending on the diagonal dominancy of the system. The computation savings 
of Malcolm and Palmer’s method is in the LU decomposition. For systems with multiple right 
hand sides, in which the factorization cost is not considered, the Malcolm and Palmer’s method 
and Thomas method have the same computation count. We assume that the AXPY operations are 
truncated after k operations; the modification vector used is either equation (20) or (38), for sym- 
metric and almost symmetric system, respectively. The b 2 \ % = 1, • • • , fc, are computed sequentially, 
or redundantly, on each PE. The modification vector 5 or i = 1, 2, will be computed concurrently 
on different PEs. We count a power- function computation as 8 floating-point operations. So, the 
modification phase costs 10 parallel operations (20 for almost symmetric systems) in total. The 
calculation of b (equation (14)) is not considered. In the computing phase, 2 -log (A;) =2 -k parallel, 
one-to-one communications are required. We use a to represent the one-to-one communication. 
On the MasPar, the one-to-one communication is achieved by using the router command. We use 
(3 to represent the broadcast. On the MasPar, the broadcast is achieved by transferring the local 
data to ACU and then distributing it to all PEs. One broadcast communication is needed in the 
modification phase. Because the tridiagonal systems that arise in the compact scheme have multi- 
ple right sides, the computation and communication count for solving multiple right-side systems is 
also listed in Table 1, where the computation of b 2% and the modification vector are not considered. 
Note that n\ is the number of right sides. 
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Table 1. Computation and communication count of the Simple Prefix Algorit 


tim 


System 

Best 

sequential 

SPP 

Computation 

Communication 

Single system 

8n-7 

5 • k + 10 

2 • k • oi T f3 

Multiple right sides 

(5 n — 3) * nl 

(4 • k + 2) * nl 

(2 • k • a + nl * (3) 


Two sample matrices are chosen to illustrate the performance of the SPP algorithm and to verify 
the theoretical error bounds given in the previous section. Both of the matrices are symmetric 
Toeplitz matrices that arise in the compact schemes. One matrix is 

A 1 = [1,3,1] - AA. 

The other matrix is 

A 2 = [1,10,1] - AA 

Equation (18) defines A A. The corresponding solution of equation (14) is b = 3 ~ 2 V ^ and b = 7 ~^ 45 
for A\ and A 2 , respectively. The error is measured relative to the LU solution. The accuracy 
comparison for solving system A\ is given in figure 8. In this implementation, no truncation 
is implemented in the modification phase, and the prediction formula used is equation (35). In 
solving A 2 , the modification is applied at the modification stage with k 2 = fc, and the prediction 
formula used is equation (38). The accuracy comparison for solving system A 2 is given in figure 9. 
From figures 8 and 9, we can see that the theoretical bound matches the measured results well. 

Speedup is defined as sequential execution time over parallel execution time. Figure 10 shows 
the speedup of the SPP algorithm over the conventional sequential algorithm for solving a single 
system. The sequential algorithm, Thomas algorithm, is based on the LU decomposition, and is fine- 
tuned to take advantage of the almost symmetric Toeplitz structure. All computations are double 
precision. Truncation numbers k = 6 and k = 4 are chosen to achieve double-precision (10 -16 ) and 
single-precision (10 -7 ) accuracy, respectively. The order of matrix is the same as the number of 
PEs available. Because of the high-speed communication, with n equal to IK, 2 K, 4 K, 8 K, and 
16K (where K = 2 10 ), the execution time is not noticeably changed in parallel processing. The 
sequential algorithm is implemented on a single PE. Because of memory limitations, only small 
systems are solved by the sequential algorithm. The data used in figure 10 is predicted based on 
the small-size timing. Figure 11 shows the corresponding speedup of solving a system with 1024 
right sides. The factorization of the matrix is not included in timing for solving the system with 
multiple right sides. The speedup is slightly higher for solving multiple right-side systems. 

Because the order of the matrix increases linearly with the number of PEs available, the speedups 
given by figures 10 and 11 are memory-bounded speedup [24]. From table 5.1, the problem size, 
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Figure 11. Speedup over the best sequential algorithm on system with 1024 right sides. 
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in terms of floating-point operations, is a linear function of the order of the matrix. The linear 
memory-bounded speedups given by figures 10 and 11 indicate a linear speed increase. In accordance 
with the isospeed metric of scalability [25], the SPP algorithm is perfectly scalable on the SIMD 
MasPar machine. The reader may refer to [25] for more information regarding scalability of parallel 
algorithm- machine combinations. 

5.2 Vector Computing 

Vector computing is widely used at national laboratories, universities, and supercomputing cen- 
ters for large-scale computing applications. For this reason, a CRAY-2S/4-128 at NASA Langley 
Research Center was also used to test the SPP algorithm on a vector machine. The Cray-2 no- 
tation “S” indicates that the memory is static rather than dynamic, and “4-128” indicates that 
the machine has 4 processors and 128 million 64-bit words of central memory. Each CPU is a 
register-to-register vector processor with a 4.1 nsec minor cycle clock that can generate 100-300 
megaflops. The four processors can be used for a single problem (multi-tasking) to achieve over 1 
gigaflop of performance. 

The speed of a vector machine depends on the vector length, vector stride, and the computa- 
tional richness of the loops. Because the vector register length is 64 and the CPU is extremely 
fast in carrying out floating-point operations, once operands are in the registers, best performance 
can be obtained with loop which have lengths that are multiples of 64, which are computationally 
intensive, and which use unit stride (separation of memory between elements). 

The chosen sample tridiagonal matrices are symmetric Toeplitz and correspond to the first and 
second derivative compact- difference operators (2) and (3). The diagonals and necessary coefficients 
are: 

As = [1, 4, 1] with a, b = 2 =b y/3 

A 2 = [1, 10, 1] with a, b = 5 ± 2a/6 

For the first experiment on the Cray, single tridiagonal solves were made. The test problem 
used here and in the rest of this section corresponds to f(x) = 3x 3 — 2x + 1 , which has smooth 
exact derivatives f r (x) = 9x 2 — 2 and f n (x) = 18x. Figure 12 shows the performance of the 
SPP in terms of CPU seconds and the matrix order compared with the LU decomposition for 
computing /' and /". In addition to the reduced memory requirements of SPP compared to LU, 
the performance shown in figure 12 clearly indicated that the SPP is faster on the vector machine 
than the conventional LU solver; the benefits increase with the operator size. The significant 
difference between the SPP and LU timing can be explained in light of vector operations versus 
scalar operations. The SPP approach can be vectorized over the direction of the solve; the LU 
approach must use scalar operations. For the SPP approach, note that the diagonal dominance of 
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the second-derivative operator /" leads to faster computations compared with the first- derivative 
operator /'. This time reduction results from the truncation of the SPP approach to obtain a 
predetermined level of error (1CU 14 ), which is essentially machine precision. For the first- derivative 
operator (A3), k = 32 = 2 6 and k\ = 24; for the second- derivative operator (A2), k = 16 = 2 4 and 
k\ = 16, where k = 2 k and k\ are the truncation numbers on the solving and modification phases, 
respectively. 



Figure 12. Timing of SPP and LU algorithms: single system. 

Real applications which use compact- difference operators require many tridiagonal solves that 
correspond to time-marching algorithms and involve many right sides corresponding to the mul- 
tidimensionality of the application. In this second evaluation, with the same accuracy 10 -14 , the 
performance of the SPP is compared with LU for multiple right sides. Shown in figure 13 are 
CPU times for the SPP and LU for various orders of the second- derivative compact operator A 2 . 
(Similar results were obtained with the operator A3 but are not shown.) For applications that 
use small operators ( N < 96), the LU solver is more efficient than SPP; for applications that use 
large operators ( N > 96), the SPP is much cheaper than the LU approach. This difference occurs 
because the LU approach vectorizes the do loop associated with the number of right sides, and 
the SPP vectorizes in the direction of the tridiagonal solve. With some creative programming, one 
could potentially vectorize the entire SPP approach with a single array, while the LU approach can 
vectorize over the right-side arrays. 

In the final experiment, the ability of SPP to control truncation error is demonstrated. The 
highest order of accuracy in the solution is based on the truncation error of the compact- difference 
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Figure 13. Timing of SPP and LU algorithms: multiple right sides. 

approaches in equations (2) and (3). As a result, to require machine-zero is overkill for the compact 
solver and leads to unnecessary computational cost. By using the inequality (38), the choice of 
truncation can be determined based on a desired error bound. Figure 14 shows the SPP results 
of truncations k = 3 and k = 5, which correspond to errors 10 -5 and 10 -14 , respectively. If the 
accuracy of the SPP is relaxed, the computational cost decreases by a factor of 2. 

6 Conclusion 

A central goal of parallel processing is to achieve better, more accurate solutions. Because obtain- 
ing more accurate solutions, in general, means adding more discretization points, larger systems 
result and require greater computational power. The accuracy of a simulation solution is also 
bounded by the discretization scheme used. A clear requirement for obtaining a more accurate 
solution is to adopt discretization methods with high-order accuracy. Previously, a highly accurate 
discretization scheme, the compact finite- difference scheme [15], has been proposed. However, the 
almost symmetric Toeplitz tridiagonal systems that arise from compact schemes are sequential in 
nature and difficult to solve efficiently on parallel computers. In this paper, we have introduced 
a parallel algorithm, the simple parallel prefix (SPP) algorithm, for compact schemes and other 
related schemes. 

The SPP algorithm is designed for fine-grain computing. With n processors, the SPP algorithm 
solves an n-dimensional system with 21og(n) + 1 AXPY operations. Two prefix communications 
are required in the solving phase and one broadcast communication is required in the modification 
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Figure 14. Timing of SPP with different accuracies. 

phase. In comparison with existing tridiagonal solvers [19, 10], the SPP algorithm is simple in 
computing and simple in communication. It requires storage of only one log(n)-dimensional vector 
for the computing phase and one n-dimensional vector for the modification phase. When the 
tridiagonal system is diagonally dominant, both the computing and the modification phases can 
be truncated without degrading the accuracy. Memory requirements will be further reduced when 
truncation is applied. A detailed accuracy analysis has been conducted to find the appropriate 
truncation number. Experimental results show that the SPP algorithm achieves a speedup greater 
than 1000 over the best sequential algorithm on a 16K PEs MasPar M-l SIMD parallel computer. 
In addition to the good performance on the SIMD machines, the SPP algorithm also out performs 
the best sequential algorithm on a vector machine (Cray 2), even on systems with multiple right 
sides. Experimental and theoretical results show that the SPP algorithm is a good choice for 
compact schemes and for the emerging high-performance parallel computers. 

The SPP algorithm is designed for symmetric and almost symmetric Toeplitz tridiagonal sys- 
tems. It is a good candidate for compact schemes, alternating direction implicit method, wavelet 
collocation method, spline curve fitting, and many other scientific applications. It can be modified 
for different boundary conditions and for cases where the number of processors p is less than the 
dimension of the system. However, generalization of the algorithm for general tridiagonal systems 
or for band systems is unlikely. 

The work presented in this paper is a continuation of efforts to design efficient parallel solvers 
for compact scheme. An efficient solver, the PDD algorithm, for coarse- or median-grain computing 
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has been proposed [21]. The PDD algorithm and the SPP algorithm can be combined on parallel 
machines with vector processing units. 
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